Introduction to¶

title

with Application to Bioinformatics¶

- Day 5¶

Day 5¶

  • Session 1
    • Quiz: Review of Day 4
    • Lecture: Go through questions
    • Lecture: Introduction to regex
    • Ex1: Find the pattern using regex
  • Session 2
    • Lecture: Regexp in Python
    • Ex2: Regexp using Python
    • PyQuiz 5.1 (10 min)
  • Session 3
    • Lecture: Sum up
    • Ex3: final exercise
  • Project time

Quiz: Review Day 4¶

Go to Canvas, Modules -> Day 5 -> Review Day 4

~15 minutes

1. What happens if you declare a variable with the same name inside and outside a function?¶

  • The variable inside the function has a separate scope and does not affect the one outside
In [4]:
name = "Max"
def changeName():
    name = "Niko"
    print(f"name inside the function: {name}, address = {id(name)}")
changeName()
print(f"name outside of the function: {name}, address = {id(name)}")
name inside the function: Niko, address = 4388913008
name outside of the function: Max, address = 4386289840

2. What is the difference between positional arguments and keyword arguments?¶

  • Keyword arguments can be given in any order, while positional arguments depend on the function's order

3. What will be the output of the following code snippet?¶

In [5]:
def add(x, y, z=0):
    return x + y + z
print(add(1, 2))
print(add(1, y=2, z=3))
3
6

4. Why is it beneficial to use docstrings in functions?¶

  • They provide explanations and details about the function for others reading your code

Both """ and ''' can be used for docstring¶

In [9]:
def add(x, y, z=0):
    """
    Calculate the sum of up to three numbers.
    
    Parameters:
    x (int/float): The first number to be added.
    y (int/float): The second number to be added.
    z (int/float, optional): The third number to be added. Defaults to 0 if not provided.
    
    Returns:
    int/float: The sum of the numbers.
    """
    # Return the sum of the provided numbers, z is optional and defaults to 0 if not specified
    return x + y + z

5. How can you see the documentation of a Python library function in the console?¶

  • Use help(library.function)

6. Which of these import statements would avoid a name conflict if there’s a local variable math in the same script?¶

  • import math as m

7. What will happen if you import the same module multiple times in a Python script?¶

  • Python ignores subsequent imports of the same module in the same script

If you run

import myMoudle

and then update myMoudle and then reload with import myMoudle in Jupyter notebook, the module will not be updated. You will need to run

from importlib import reload
reload(myModule)

8. If you want to filter rows in df where age is greater than 30, which command would you use?¶

  • df[df['age'] > 30]
In [14]:
import pandas as pd
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'height': [165.4, 175.3, 168.5, 180.6]
}
df = pd.DataFrame(data)
print(df)
df[df['age'] > 30]
      name  age  height
0    Alice   25   165.4
1      Bob   30   175.3
2  Charlie   35   168.5
3    David   40   180.6
Out[14]:
name age height
2 Charlie 35 168.5
3 David 40 180.6

9. If you want to rename multiple columns in a DataFrame df, which method should you use?¶

  • df.rename(columns={'old_col1': 'new_col1', 'old_col2': 'new_col2'})

If you don't specify the key columns, it renames the rows

New topic: Regular Expressions¶

  • A regular expression (regex or regexp) is a sequence of characters that defines a search pattern.
  • Use case: Regular expressions are used in text processing for searching, matching, and manipulating strings

Examples where regex can play a role¶

  • Find variations in a protein or DNA sequence
    • "MVR???A"
    • "ATG???TAG"
  • American/British spelling, endings and other variants:
    • salpeter, salpetre, saltpeter, nitre, niter or KNO3
    • hemaglobin, heamoglobin, hemaglobins, heamoglobin's
    • catalyze, catalyse, catalyzed...
  • A pattern in a VCF file
    • a digit appearing after a tab

Regex is not unique for Python and it is supported by¶

  • most programming languages,
  • text editors
  • command line tools
  • search engines
In [15]:
!grep "furniture.*sell" ../downloads/blocket_listings.txt 
desk	furniture	sell	2000	2018-01-14
couch	furniture	sell	500	2018-10-05
shoerack	furniture	sell	200	2018-10-24
wardrobe	furniture	sell	300	2018-10-23

Defining a search pattern¶

regex regex

Common operations¶

Building blocks for creating patterns

  • . matches any character (once)
  • ? repeat previous pattern 0 or 1 times
  • * repeat previous pattern 0 or more times
  • + repeat previous pattern 1 or more times

Pattern for matching the colour family¶

colour.*

.* matches everything (including the empty string)!

Pattern for matching the different spellings¶

salt?peter

What about the different endings: er-re?
"salt?pet.."
saltpeter
"saltpet88"
"salpetin"
"saltpet "

More common operations - classes of characters¶

  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace

More common operations - classes of characters¶

  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace

\w+ regex_w

More common operations - classes of characters¶

  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace

\d+

regex_d

More common operations - classes of characters¶

  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace

\s+

regex_s

More common operations - classes of characters¶

  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace
  • [abc] matches a single character defined in this set {a, b, c}
  • [^abc] matches a single character that is not a, b or c

[a-z] matches all letters between a and z (the english alphabet).¶

[a-z]+ matches any (lowercased) english word.¶

salt?pet[er]+

saltpeter
salpetre

"saltpet88"
"salpetin"
"saltpet "

Example - finding patterns in a VCF file¶

1 920760 rs80259304 T C . PASS AA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131 GT:DP:CB 0/1:1:SM 0/0:4/SM...

  • Find a sample:

0/0 0/1 1/1 ...

"[01]/[01]" (or "\d/\d")

\s[01]/[01]:

Example - finding patterns in vcf

1 920760 rs80259304 T C . PASS AA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131 GT:DP:CB 0/1:1:SM 0/0:4/SM...

  • Find all lines containing more than one homozygous sample.

... 1/1:... ... 1/1:... ...

.*1/1.*1/1.*

.*\s1/1:.*\s1/1:.*

Cheat sheet¶

  • . matches any character (once)
  • ? repeat previous pattern 0 or 1 times
  • * repeat previous pattern 0 or more times
  • + repeat previous pattern 1 or more times
  • \w matches any letter or number, and the underscore
  • \d matches any digit
  • \D matches any non-digit
  • \s matches any whitespace (spaces, tabs, ...)
  • \S matches any non-whitespace
  • [abc] matches a single character defined in this set {a, b, c}
  • [^abc] matches a single character that is not a, b or c
  • [a-z] matches any (lowercased) letter from the english alphabet
  • .* matches anything

  • https://regex101.com/

Day 5, Exercise 1 (~30 min)¶

Practicing regular expressions¶

  • Canvas -> Modules -> Day 5 -> Exercise 1 - day 5

Start the exercise by running


python retester.py

in the downloads folder in a terminal

Take a break after the exercise (~10 min)¶

Session 2¶

  • How to use regex in Python
  • Ex2: Regex using Python
  • PyQuiz 5.1

Regular expressions in Python¶

In [37]:
# Import module
import re
In [38]:
# Define a pattern
p = re.compile('ab*')
p
Out[38]:
re.compile(r'ab*', re.UNICODE)

Searching¶

In [41]:
# Search pattern in string
p = re.compile('ab*')

p.search('abc')
Out[41]:
<re.Match object; span=(0, 2), match='ab'>
In [42]:
print(p.search('cb'))
None
In [43]:
p = re.compile('HELLO')
m = p.search('gsdfgsdfgs  HELLO  __!@£§≈[|ÅÄÖ‚…’fi]')

print(m)
<re.Match object; span=(12, 17), match='HELLO'>

Case insensitiveness¶

In [46]:
# Remember, [a-z]+ matches any lower case english word
p = re.compile('[a-z]+')
result = p.search('ATGAAA')
print(result)
None
In [49]:
p = re.compile('[a-z]+', re.IGNORECASE)

result = p.search('ATGAAA')
result
Out[49]:
<re.Match object; span=(0, 6), match='ATGAAA'>

The match object¶

In [51]:
p = re.compile('[a-z]+', re.IGNORECASE)

result = p.search('123 ATGAAA 456')
result
Out[51]:
<re.Match object; span=(4, 10), match='ATGAAA'>

result.group(): Return the string matched by the expression

result.start(): Return the starting position of the match

result.end(): Return the ending position of the match

result.span(): Return both (start, end)

In [52]:
result.group()
Out[52]:
'ATGAAA'
In [53]:
result.start()
Out[53]:
4
In [54]:
result.end()
Out[54]:
10
In [55]:
result.span()
Out[55]:
(4, 10)

Zero or more...?¶

In [19]:
p = re.compile('.*HELLO.*')
In [20]:
m = p.search('lots of text  HELLO  more text and characters!!! ^^')
In [21]:
m.group()
Out[21]:
'lots of text  HELLO  more text and characters!!! ^^'

The * is greedy.

Finding all the matching patterns¶

In [71]:
# Find all instance of the defined pattern
p = re.compile('HELLO')
matches = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')
print(matches)
<callable_iterator object at 0x7ff202b6efa0>
In [72]:
# Loop through matches
for match in matches:
    print(f'Found {match.group()} at position {match.start()}')
Found HELLO at position 14
Found HELLO at position 32

How to find a full stop?¶

In [79]:
txt = "The first full stop is here: ."
pattern = re.compile('.')

match = pattern.search(txt)
print('"{}" at position {}'.format(match.group(), match.start()))
"T" at position 0
In [85]:
# Print all matches
matches = p.finditer(txt)
#for match in matches:
#    print('"{}" at position {}'.format(match.group(), match.start()))
In [86]:
# Use escape character to search
p = re.compile('\.')

m = p.search(txt)
print('"{}" at position {}'.format(m.group(), m.start()))
"." at position 29

More operations¶

  • \ escaping a character
  • ^ beginning of the string
  • $ end of string
  • | boolean or

^hello$

salt?pet(er|re) | nit(er|re) | KNO3

Substitution¶

Finally, we can fix our spelling mistakes!¶

In [87]:
txt = "Do it   becuase   I say so,     not becuase you want!"
In [89]:
# Spell the word because correctly
import re
p = re.compile('becuase')
txt = p.sub('because', txt)
print(txt)
Do it   because   I say so,     not because you want!
In [90]:
# Remove additional spaces
p = re.compile('\s+')
p.sub(' ', txt)
Out[90]:
'Do it because I say so, not because you want!'

Overview¶

  • Construct regular expressions

    p = re.compile()
    
  • Searching

    p.search(text)
    
  • Substitution

    p.sub(replacement, text)
    

Typical code structure:

pattern = re.compile( ... )
match = pattern.search('string goes here')
if m:
    print('Match found: ', match.group())
else:
    print('No match')

Regular expressions¶

  • A powerful tool to search and modify text

  • There is much more to read in the docs

  • Note: regex comes in different flavours. If you use it outside Python, there might be small variations in the syntax.

Day 5, Exercise 2 (~30 min)¶

Use regular expressions with Python¶

  • Canvas -> Modules -> Day 5 -> Exercise 2 - day 5

Take a break after the exercise (~10 min)¶


PyQuiz 5.1¶


Lunch¶

Sum up!

Processing files - looping through the lines¶

fh = open('myfile.txt')
for line in fh:
    do_stuff(line)

Store values¶

iterations = 0
information = []

fh = open('myfile.txt', 'r')
for line in fh:
    iterations += 1
    information += do_stuff(line)

Values¶

  • Base types:

    str     "hello"
    int     5
    float   5.2
    bool    True
    
  • Collections:

    list  ["a", "b", "c"]
    dict  {"a": "alligator", "b": "bear", "c": "cat"}
    tuple ("this", "that")
    set   {"drama", "sci-fi"}
    

Assign values¶

iterations = 0
score = 5.2

Compare and membership¶

+, -, *,...   # mathematical
and, or, not  # logical 
==, !=        # (in)equality
<, >, <=, >=  # comparison
in            # membership
In [91]:
value = 4
nextvalue = 1
nextvalue += value
print('nextvalue: ', nextvalue, 'value: ', value)
nextvalue:  5 value:  4
In [40]:
x = 5
y = 7
z = 2
x > 6 and y == 7 or z > 1
Out[40]:
True
In [41]:
(x > 6 and y == 7) or z > 1
Out[41]:
True

Strings¶

Works like a list of characters

In [23]:
mystr = "one"
In [24]:
mystr += " two" # string concatnation 
mystr
Out[24]:
'one two'
In [25]:
len(mystr) # get the length
Out[25]:
7
In [26]:
"one" in mystr # membership checking
Out[26]:
True

String is immutable¶

In [27]:
mystr = "one"
mystr[1] = "W"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[27], line 2
      1 mystr = "one"
----> 2 mystr[1] = "W"

TypeError: 'str' object does not support item assignment
In [28]:
mystr = "one"
print(mystr)
mystr = "two"
print(mystr)
one
two
In [30]:
mystr = "one"
print(f"mystr = {mystr}, address = {id(mystr)}")
mystr = "two"
print(f"mystr = {mystr}, address = {id(mystr)}")
mystr = one, address = 4330374000
mystr = two, address = 4414743344

String manipulation¶

s.strip()  # remove unwanted spacing

  s.split()  # split line into columns

  s.upper(), s.lower()  # change the case

Regular expressions help you find and replace strings.¶

p = re.compile('A.A.A')
  p.search(dnastring)

  p = re.compile('T')
  p.sub('U', dnastring)
In [92]:
import re

p = re.compile('p.*\sp')  # the greedy star!

p.search('a python programmer writes python code').group()
Out[92]:
'python programmer writes p'

Collections¶

Can contain strings, integer, booleans...

  • Mutable: you can add, remove, change values

  • Lists:

    mylist.append('value')
    
  • Dicts:

    mydict['key'] = 'value'
    
  • Sets:

    myset.add('value')
    

Collections¶

  • Test for membership:

    value in myobj
    
  • Check size:

    len(myobj)
    

Lists¶

  • Ordered!
todolist = ["work", "sleep", "eat", "work"]

todolist.sort()
todolist.reverse()
todolist[2]
todolist[-1]
todolist[2:6]
In [101]:
todolist = ["work", "sleep", "eat", "work"]
In [94]:
todolist.sort()
print(todolist)
['eat', 'sleep', 'work', 'work']
In [95]:
todolist.reverse()
print(todolist)
['work', 'work', 'sleep', 'eat']
In [96]:
todolist[2]
Out[96]:
'sleep'
In [99]:
todolist[-1]
Out[99]:
'eat'
In [103]:
todolist[2:]
Out[103]:
['eat', 'work']

Dictionaries¶

  • Keys have values
mydict = {"a": "alligator", "b": "bear", "c": "cat"}
counter = {"cats": 55, "dogs": 8}

mydict["a"]
mydict.keys()
mydict.values()
In [104]:
counter = {'cats': 0, 'others': 0}

for animal in ['zebra', 'cat', 'dog', 'cat']:
    if animal == 'cat':
        counter['cats'] += 1
    else:
        counter['others'] += 1
        
counter
Out[104]:
{'cats': 2, 'others': 2}

Sets¶

  • Bag of values

    • No order

    • No duplicates

    • Fast membership checks

    • Logical set operations (union, difference, intersection...)

myset = {"drama", "sci-fi"}

myset.add("comedy")

myset.remove("drama")
In [105]:
todolist = ["work", "sleep", "eat", "work"]

todo_items = set(todolist)
todo_items
Out[105]:
{'eat', 'sleep', 'work'}
In [106]:
todo_items.add("study")
todo_items
Out[106]:
{'eat', 'sleep', 'study', 'work'}
In [107]:
todo_items.add("eat")
todo_items
Out[107]:
{'eat', 'sleep', 'study', 'work'}

Tuples¶

  • A group (usually two) of values that belong together
tup = (max_length, sequence)
  • An ordered sequence (like lists)
length = tup[0]  # get content at index 0
  • Immutable
In [53]:
tup = (2, 'xy')
tup[0]
Out[53]:
2
In [54]:
tup[0] = 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-54-874559a0c62a> in <module>
----> 1 tup[0] = 2

TypeError: 'tuple' object does not support item assignment

Tuples in functions¶

def find_longest_seq(file):
    # some code here...
    return length, sequence
answer = find_longest_seq(filepath)
print('length', answer[0])
print('sequence', answer[1])
answer = find_longest_seq(filepath) # return as a tuple
length, sequence = find_longest_seq(filepath) # return as two variables

Deciding what to do¶

if count > 10:
   print('big')
elif count > 5:
   print('medium')
else:
   print('small')
In [108]:
shopping_list = ['bread', 'egg', ' butter', 'milk']
tired         = True

if len(shopping_list) > 4:
    print('Really need to go shopping!')
elif not tired:
    print('Not tired? Then go shopping!')
else:
    print('Better to stay at home')   
Better to stay at home

Deciding what to do - if statement¶

Drawing

Program flow - for loops¶

information = []
fh = open('myfile.txt', 'r')

for line in fh:
    if is_comment(line):
       use_comment(line)
    else:
       information = read_data(line)

Drawing

Program flow - while loops¶

keep_going = True
information = []
index = 0

while keep_going:
    current_line = lines[index]
    information += read_line(current_line)
    index += 1
    if check_something(current_line):
        keep_going = False

Drawing

Different types of loops¶

For loop

is a control flow statement that performs operations over a known amount of steps.

While loop

is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition.


Which one to use?

For loops - standard for iterations over lists and other iterable objects

While loops - more flexible and can iterate an unspecified number of times

In [56]:
user_input = "thank god it's friday"
for letter in user_input:
    print(letter.upper())
T
H
A
N
K
 
G
O
D
 
I
T
'
S
 
F
R
I
D
A
Y
In [57]:
i = 0
while i < len(user_input):
    letter = user_input[i]
    print(letter.upper())
    i += 1
T
H
A
N
K
 
G
O
D
 
I
T
'
S
 
F
R
I
D
A
Y

Controlling loops¶

  • break - stop the loop
  • continue - go on to the next iteration
In [59]:
user_input = "thank god it's friday"
for letter in user_input:

    if letter == 'd':
        break
    print(letter.upper())
T
H
A
N
K
 
G
O

Watch out!

In [31]:
# DON'T RUN THIS
i = 0
-while i < 10:    
    print(user_input[i])
  Cell In[31], line 3
    -while i < 10:
     ^
SyntaxError: invalid syntax

While loops may be infinite!

File Input/Output¶

  • In:

    • Read files: fh = open(filename, 'r')
      • for line in fh:
        • fh.read()
        • fh.readlines()
    • Read information from command line: sys.argv[1:]
  • Out:

    • Write files: fh = open(filename, 'w')
      • fh.write(text)
    • Printing: print('my_information')

Input/Output¶

  • Open files should be closed:
    • fh.close()

or use with clause

Code structure¶

  • Functions
  • Modules

Functions¶

  • A named piece of code that performs a certain task.

Drawing

  • Is given a number of input arguments
    • to be used (are in scope) within the function body
  • Returns a result (maybe None)

Functions - keyword arguments¶

def prettyprinter(name, value, delim=":", end=None):
    out = "The " + name + " is " + delim + " " + value
    if end:
        out += end
    return out
  • used to set default values (often None)
  • can be skipped in function calls
  • improve readability

Using your code¶

Any longer pieces of code that have been used and will be re-used should be saved

  • Save it as a file .py

  • To run it: python3 mycode.py or python mycode.py

  • Import it: import mycode

Documentation and comments¶

""" This is a doc-string explaining what the purpose of this function/module is """
# This is a comment that helps understanding the code
  • Comments will help you
  • Undocumented code rarely gets used
  • Try to keep your code readable: use informative variable and function names

Why programming?¶

Endless possibilities!

  • reverse complement DNA
  • custom filtering of VCF files
  • plotting of results
  • all excel stuff!

Why programming?¶

  • Computers are fast
  • Computers don't get bored
  • Computers don't get sloppy
  • Create reproducable results
    • for you and for others to use
  • Extract large amount of information

Final advice¶

  • Take a moment to think before you start coding
    • use pseudocode
    • use top-down programming
    • use paper and pen
    • take breaks
  • You know the basics - don't be afraid to try, it's the only way to learn
  • You will get faster

Final advice (for real)¶

  • Getting help
    • ask colleauges
    • try talk about your problem (get a rubber duck https://en.wikipedia.org/wiki/Rubber_duck_debugging)
    • search the web
    • NBIS drop-ins

Now you know Python!
¶


🎉

Well done!
Just a small quiz to finish the day¶